This week’s lab is a musical lab. You’ll be requesting data from the Spotify API and using it to build k-nearest neighbor and decision tree models. In order to use the Spotify you must have a Spotify account. If you don’t have one, sign up for a free one here: https://www.spotify.com/us/signup.
Once you have an account, go to Spotify for developers (https://developer.spotify.com/) and log in. Click the green “Create a Client ID” button to fill out the form to create an app create an app so you can access the API. On your developer dashboard page, click on the new app you just created. On the app’s dashboard page you will find your Client ID just under the header name of your app. Click “Show Client Secret” to access your secondary Client ID. When you do this you’ll be issued a Spotify client ID and client secret key.
You have two options for completing this lab.
Option 1: Classify by users. Build models that predict whether a given song will be in your collection vs. a partner in class. This requires that you were already a Spotify user so you have enough data to work with. You will download your data from the Spotify API and then exchange with another member of class.
Option 2: Classify by genres. Build models that predict which genre a song belongs to. This will use a pre-existing Spotify dataset available from Kaggle.com (https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify)
#LOAD NECESSARY PACKAGES
library(spotifyr) #API interaction
library(tidyverse) #cleaning data, ggplot, etc
library(tidymodels) #for modeling/statistical analysis
library(rsample) #for splitting data into train / test
library(recipes) #for creating the recipe for ML
#library(skimr) #for data exploration / early summary stats and viz
library(kknn) #for KNN modeling
library(plotly) #for data viz
library(ggpubr) #for data viz
library(here) #for simplifying file path navigation
library(baguette) #for bagging decision trees
library(ranger) # engine for random forests
library(kableExtra) #for creating a formatted table
Client ID and Client Secret are required to create and access token that is required to interact with the API. You can set them as system values so we don’t have to do provide them each time.
#API key stored in token.R, which is not git tracked for security
source(here("token.R"))
This may result in an error: “INVALID_CLIENT: Invalid redirect URI.” This can be resolved by editing the callback settings on your app. Go to your app and click “Edit Settings”. Under redirect URLs paste this: http://localhost:1410/ and click save at the bottom.
You can use get_my_saved_tracks() to request all your liked tracks. It would be good if you had at least 150-200 liked tracks so the model has enough data to work with. If you don’t have enough liked tracks, you can instead use get_my_recently_played(), and in that case grab at least 500 recently played tracks if you can.
The Spotify API returns a dataframe of tracks and associated attributes. However, it will only return up to 50 (or 20) tracks at a time, so you will have to make multiple requests. Use a function to combine all your requests in one call.
#GETTING A DATAFRAME OF LIKED SONGS
#In the get_my_saved_tracks() function, you can use offsets() to specify the index of the first track to return.
offsets = seq(from = 0, to = 1000, by = 50)
#initializing an empty matrix
liked_tracks <- data.frame(matrix(nrow = 0, ncol = 30))
#Function to get my 1050 most recently liked tracks
for (i in seq_along(offsets)) {
my_tracks = get_my_saved_tracks(limit = 50,
offset = offsets[i])
df_temp = as.data.frame(my_tracks) #creating a temporary data frame to store the 50 liked tracks from a given run
liked_tracks <- rbind(liked_tracks, df_temp) #binding the temporary data frame to my liked tracks data frame.
}
Once you have your tracks, familiarize yourself with this initial dataframe. You’ll need to request some additional information for the analysis. If you give the API a list of track IDs using get_track_audio_features(), it will return an audio features dataframe of all the tracks and some attributes of them.
#obtain a list of the track IDs
ids <- liked_tracks$track.id
#the ids argument in `get_track_audio_features()` can only take 100 IDs at a time, so I'm splitting them into 100 track groupings
ids_split <- split(ids, ceiling(seq_along(ids) / 100))
#create an empty list
my_tracks_audio_feats_list <- list()
#Iterating the `get_track_audio_features()`function over each ID split and storing the resulting data in a list.
for (i in 1:length(ids_split)) {
my_tracks_audio_feats_list[[i]] <- get_track_audio_features(ids = ids_split[[i]])
}
#Combine the list of data frames into a single data frame.
my_tracks_audio_feats <- do.call(rbind, my_tracks_audio_feats_list)
These track audio features are the predictors we are interested in, but this dataframe doesn’t have the actual names of the tracks. Append the ‘track.name’ column from your favorite tracks database.
#selecting the track id and track name from liked tracks so I can use left_join and only add the track name to the audio features data set
liked_tracks_join <- liked_tracks %>%
select(track.id, track.name, track.artists) %>%
mutate(primary_artist = unlist(lapply(liked_tracks$track.artists, function(x) x$name[1]))) %>%
select(-track.artists)
#add the track name and artist to the audio features data frame
liked_tracks_full <- left_join(my_tracks_audio_feats, liked_tracks_join, by = c("id" = "track.id"))
#save this data frame as a CVS so it can be shared with others
write_csv(liked_tracks_full, "lewis_liked_tracks.csv")
Find a class mate whose data you would like to use. Add your partner’s data to your dataset. Create a new column that will contain the outcome variable that you will try to predict. This variable should contain two values that represent if the track came from your data set or your partner’s.
# reading in Elke's liked tracks. Add a column to make it clear which tracks are Elke's
elke_liked_tracks <- read_csv(here("elke_liked_tracks.csv")) %>%
mutate(listener = "Elke")
#add a column named "listener" to make it clear that these are my tracks
lewis_liked_tracks <- liked_tracks_full %>%
mutate(listener = "Lewis")
# Combine all of our liked tracks into one data frame
all_tracks <- rbind(lewis_liked_tracks, elke_liked_tracks) %>%
select(-(type:analysis_url))
Let’s take a look at your data. Do some exploratory summary stats and visualization.
For example: What are the most danceable tracks in your dataset? What are some differences in the data between users (Option 1) or genres (Option 2)?
# names(lewis_liked_tracks) #for checking out column names
#my most danceable songs
lewis_liked_tracks %>%
select(danceability, track.name, primary_artist) %>%
arrange(-danceability) %>%
head(5)
## # A tibble: 5 × 3
## danceability track.name primary_artist
## <dbl> <chr> <chr>
## 1 0.984 Conceited Flo Milli
## 2 0.975 CAN'T TOUCH THIS BIA
## 3 0.971 We Not Humping - Remix Monaleo
## 4 0.964 Slumber Party (feat. Princess Nokia) Ashnikko
## 5 0.958 Commando Migos
#my least danceable songs
lewis_liked_tracks %>%
select(danceability, track.name, primary_artist) %>%
arrange(danceability) %>%
head(5)
## # A tibble: 5 × 3
## danceability track.name primary_artist
## <dbl> <chr> <chr>
## 1 0.164 Crying All the Time Alexandra Savior
## 2 0.185 Appointments Julien Baker
## 3 0.205 Get Free Lana Del Rey
## 4 0.215 Funeral Phoebe Bridgers
## 5 0.218 Alone Again The Weeknd
#my most "speechy" songs
lewis_liked_tracks %>%
select(speechiness, track.name, primary_artist) %>%
arrange(-speechiness) %>%
head(5)
## # A tibble: 5 × 3
## speechiness track.name primary_ar…¹
## <dbl> <chr> <chr>
## 1 0.781 Yoga 645AR
## 2 0.635 Big Black Truck (with JID) Dreamville
## 3 0.595 Ghost Of Soulja Slim Jay Electro…
## 4 0.59 That's All I Have Lil Wayne
## 5 0.583 Ghetto Gods Freestyle (with EARTHGANG feat. 2 Chainz) Dreamville
## # … with abbreviated variable name ¹primary_artist
#my least "speechy" songs
lewis_liked_tracks %>%
select(speechiness, track.name, primary_artist) %>%
arrange(speechiness) %>%
head(5)
## # A tibble: 5 × 3
## speechiness track.name primary_artist
## <dbl> <chr> <chr>
## 1 0.0241 Me & My Dog boygenius
## 2 0.0242 Dreams Tonite Alvvays
## 3 0.0251 Vessels Julien Baker
## 4 0.0254 The Archer Alexandra Savior
## 5 0.0254 Notion Tash Sultana
#DT Table for vars of interest
library(DT)
lewis_liked_tracks %>%
select(track.name, primary_artist, danceability, acousticness, speechiness, energy, valence) %>%
datatable()
#Danceability comparison
dance_plot <- ggplot(all_tracks, aes(x = danceability, fill = listener,
text = paste(listener))) +
geom_density(alpha=0.6, color=NA) +
scale_fill_manual(values=c("#b0484f", "#4436d9"))+
labs(x="Danceability", y="Density") +
guides(fill=guide_legend(title="Listener"))+
theme_minimal() +
ggtitle("Distribution of Danceability Data")
#speechiness comparison
speech_plot <- ggplot(all_tracks, aes(x = speechiness, fill = listener,
text = paste(listener))) +
geom_density(alpha=0.6, color=NA) +
scale_fill_manual(values=c("#b0484f", "#4436d9"))+
labs(x="Speechiness", y="Density") +
guides(fill=guide_legend(title="Listener"))+
theme_minimal() +
ggtitle("Distribution of Speechiness Data")
#acousticness comparison
acoustic_plot <- ggplot(all_tracks, aes(x = acousticness, fill = listener,
text = paste(listener))) +
geom_density(alpha=0.6, color=NA) +
scale_fill_manual(values=c("#b0484f", "#4436d9"))+
labs(x="Acousticness", y="Density") +
guides(fill=guide_legend(title="Listener"))+
theme_minimal() +
ggtitle("Distribution of Acousticness Data")
#energy comparison
energy_plot <- ggplot(all_tracks, aes(x = energy, fill = listener,
text = paste(listener))) +
geom_density(alpha=0.6, color=NA) +
scale_fill_manual(values=c("#b0484f", "#4436d9"))+
labs(x="Energy", y="Density") +
guides(fill=guide_legend(title="Listener"))+
theme_minimal() +
ggtitle("Distribution of Energy Data")
ggarrange(dance_plot, speech_plot, acoustic_plot, energy_plot, ncol=2, nrow=2, common.legend = TRUE, legend="bottom")
#My music appears to be higher energy, more danceable, and more "speechy," while Elke's music tends to be more acoustic.
#plotting valence and energy to get a sense for the moods of our liked songs
ggplot(data = all_tracks, aes(x = valence, y = energy, color = listener)) +
geom_point(alpha = 0.5) +
geom_vline(xintercept = 0.5) +
geom_hline(yintercept = 0.5) +
scale_x_continuous(expand = c(0, 0), limits = c(0, 1)) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 1)) +
annotate('text', 0.25 / 2, 0.95, label = "Turbulent/Angry", fontface =
"bold") +
annotate('text', 1.75 / 2, 0.95, label = "Happy/Joyful", fontface = "bold") +
annotate('text', 1.75 / 2, 0.05, label = "Chill/Peaceful", fontface =
"bold") +
annotate('text', 0.25 / 2, 0.05, label = "Sad/Depressing", fontface =
"bold") +
theme_minimal() +
labs(x = "Valence",
y = "Energy",
title = "Plotting songs based on their positivity and energy level",
subtitle = "Elke and I don't have many songs in the Chill/Peaceful quadrant.")
## this plotly allows you to hover over points to see the track and the corresponding artist
plot_ly(data = all_tracks,
x = ~valence,
y = ~energy,
color = ~listener,
colors = c("#D1999C", "#9C8CEC"),
type = "scatter",
mode = "markers",
text = paste("Track Name:", all_tracks$track.name, "<br>",
"Primary Artist:", all_tracks$primary_artist, "<br>",
"Valence:", all_tracks$valence, "<br>",
"Energy:", all_tracks$energy, "<br>",
"Listener:", all_tracks$listener)) %>%
layout(xaxis = list(title = "Valence"),
yaxis = list(title = "Energy"),
hovermode = "closest",
title = "Track Valence vs Energy")
#Playing around with another visualization
all_tracks_summary <- all_tracks %>%
dplyr::group_by(listener) %>%
summarize(danceability = mean(danceability),
energy = mean(energy),
speechiness = mean(speechiness),
acousticness = mean(acousticness),
valence = mean(valence)) %>%
t() %>%
as.data.frame() %>%
rename("Elke" = "V1",
"Lewis" = "V2")
all_tracks_summary <- all_tracks_summary[-1,] %>%
mutate(Elke = as.numeric(Elke),
Elke = round(Elke, 2),
Lewis = as.numeric(Lewis),
Lewis = round(Lewis, 2))
all_tracks_summary
## Elke Lewis
## danceability 0.57 0.65
## energy 0.52 0.58
## speechiness 0.05 0.16
## acousticness 0.43 0.30
## valence 0.39 0.43
library(ggtext)
all_tracks_summary %>%
rownames_to_column() %>%
arrange(Elke) %>%
mutate(rowname = factor(rowname, rowname)) %>%
ggplot(aes(x = rowname, y = Elke)) +
geom_segment(aes(x = rowname ,xend = rowname, y = Lewis, yend = Elke), color = "grey") +
geom_point(size = 5, color = "#b0484f") +
geom_point(aes(y = Lewis), size = 5, color = "#9C8CEC") +
coord_flip() +
theme_minimal() +
theme(
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.text.x = element_text(size = 10, angle = 0, hjust = 1),
axis.text.y = element_text(size = 10),
plot.title = element_markdown(size = 14)
) +
labs(x = "Music Metric",
y = "Music Metric Score",
title = "**Comparing <span style='color:#b0484f'>Elke's</span> Average Music Metric Values with <span style='color:#9C8CEC'>Mine</span>**")
#some quick t-tests comparing my music to Elke's
#danceability
t.test(lewis_liked_tracks$danceability, elke_liked_tracks$danceability, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: lewis_liked_tracks$danceability and elke_liked_tracks$danceability
## t = 10.541, df = 1129.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.06934386 0.10106319
## sample estimates:
## mean of x mean of y
## 0.6519095 0.5667060
#speechiness
t.test(lewis_liked_tracks$speechiness, elke_liked_tracks$speechiness, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: lewis_liked_tracks$speechiness and elke_liked_tracks$speechiness
## t = 21.166, df = 1499.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.09240622 0.11128260
## sample estimates:
## mean of x mean of y
## 0.1567638 0.0549194
#acousticness
t.test(lewis_liked_tracks$acousticness, elke_liked_tracks$acousticness, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: lewis_liked_tracks$acousticness and elke_liked_tracks$acousticness
## t = -7.4094, df = 909.18, p-value = 2.898e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.16341060 -0.09497117
## sample estimates:
## mean of x mean of y
## 0.2989228 0.4281137
#energy
t.test(lewis_liked_tracks$energy, elke_liked_tracks$energy, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: lewis_liked_tracks$energy and elke_liked_tracks$energy
## t = 4.7725, df = 861.12, p-value = 2.137e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.03126572 0.07494569
## sample estimates:
## mean of x mean of y
## 0.5768379 0.5237322
#Each t.test was significant at the 0.001 significance level, clearly suggesting that my music is different than Elke's across the variables tested.
Create two models, a k-nearest neighbor model and a decision tree model that predict whether a track belongs to you or your partner’s collection. Then validate and compare the performance of the two models you have created.
Make sure to use appropriate resampling to select the best version of each algorithm to compare and some appropriate visualization of your results.
#prepare all_tracks for modeling by removing songs with duplicate values (tough to predict if both Elke and I liked a song), making nominal data a factor, and removing variables that don't make sense to include in the model
all_tracks_modeling <- all_tracks[!duplicated(all_tracks$track.name, fromLast = TRUE) & !duplicated(all_tracks$track.name), ] %>%
mutate_if(is.ordered, .funs = factor, ordered = F) %>%
select(-track.name, -primary_artist) %>%
mutate(listener = as.factor(listener))
#splitting the data
set.seed(123)
#initial split of data ~ we're going with 70/30 because the sample size isn't super large for testing
tracks_split <- initial_split(all_tracks_modeling, prop = .7)
tracks_train <- training(tracks_split)
tracks_test <- testing(tracks_split)
#pre-processing
# We need to create a recipe and do the pre-processing by converting dummy coding the nominal variables and normalizing the numeric variables.
tracks_recipe <- recipe(listener ~ ., data = tracks_train) %>%
#step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
step_normalize(all_numeric(), -all_outcomes()) %>% #normalize numeric to make sure scale is okay
prep()
# Define our KNN model with tuning
knn_spec_tune <- nearest_neighbor(neighbor = tune()) %>%
set_mode("classification") %>%
set_engine("kknn")
# Check the model
knn_spec_tune
## K-Nearest Neighbor Model Specification (classification)
##
## Main Arguments:
## neighbors = tune()
##
## Computational engine: kknn
# Define a new workflow
wf_knn_tune <- workflow() %>%
add_model(knn_spec_tune) %>%
add_recipe(tracks_recipe)
# 10-fold CV on the training dataset
set.seed(123)
cv_folds <- tracks_train %>%
vfold_cv(v = 10)
# Fit the workflow on our predefined folds and hyperparameters
fit_knn_cv <- wf_knn_tune %>%
tune_grid(
cv_folds,
grid = data.frame(neighbors = c(1, 5, seq(10, 150, 10))) # try with 1 nearest neighbor, try with 5, 10, 20, 30, ..., 100
)
# Check the performance with collect_metrics()
print(n=32, fit_knn_cv %>% collect_metrics())
## # A tibble: 34 × 7
## neighbors .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 accuracy binary 0.690 10 0.00832 Preprocessor1_Model01
## 2 1 roc_auc binary 0.623 10 0.00967 Preprocessor1_Model01
## 3 5 accuracy binary 0.711 10 0.0109 Preprocessor1_Model02
## 4 5 roc_auc binary 0.727 10 0.00781 Preprocessor1_Model02
## 5 10 accuracy binary 0.756 10 0.00950 Preprocessor1_Model03
## 6 10 roc_auc binary 0.755 10 0.00838 Preprocessor1_Model03
## 7 20 accuracy binary 0.757 10 0.00749 Preprocessor1_Model04
## 8 20 roc_auc binary 0.768 10 0.0108 Preprocessor1_Model04
## 9 30 accuracy binary 0.755 10 0.0106 Preprocessor1_Model05
## 10 30 roc_auc binary 0.776 10 0.0127 Preprocessor1_Model05
## 11 40 accuracy binary 0.756 10 0.0112 Preprocessor1_Model06
## 12 40 roc_auc binary 0.780 10 0.0137 Preprocessor1_Model06
## 13 50 accuracy binary 0.753 10 0.0119 Preprocessor1_Model07
## 14 50 roc_auc binary 0.781 10 0.0148 Preprocessor1_Model07
## 15 60 accuracy binary 0.758 10 0.0131 Preprocessor1_Model08
## 16 60 roc_auc binary 0.782 10 0.0146 Preprocessor1_Model08
## 17 70 accuracy binary 0.755 10 0.0129 Preprocessor1_Model09
## 18 70 roc_auc binary 0.783 10 0.0144 Preprocessor1_Model09
## 19 80 accuracy binary 0.754 10 0.0136 Preprocessor1_Model10
## 20 80 roc_auc binary 0.783 10 0.0146 Preprocessor1_Model10
## 21 90 accuracy binary 0.751 10 0.0126 Preprocessor1_Model11
## 22 90 roc_auc binary 0.784 10 0.0150 Preprocessor1_Model11
## 23 100 accuracy binary 0.751 10 0.0147 Preprocessor1_Model12
## 24 100 roc_auc binary 0.785 10 0.0153 Preprocessor1_Model12
## 25 110 accuracy binary 0.748 10 0.0137 Preprocessor1_Model13
## 26 110 roc_auc binary 0.786 10 0.0155 Preprocessor1_Model13
## 27 120 accuracy binary 0.748 10 0.0135 Preprocessor1_Model14
## 28 120 roc_auc binary 0.786 10 0.0157 Preprocessor1_Model14
## 29 130 accuracy binary 0.745 10 0.0140 Preprocessor1_Model15
## 30 130 roc_auc binary 0.786 10 0.0156 Preprocessor1_Model15
## 31 140 accuracy binary 0.744 10 0.0148 Preprocessor1_Model16
## 32 140 roc_auc binary 0.787 10 0.0156 Preprocessor1_Model16
## # … with 2 more rows
# The final workflow for our KNN model
final_wf <-
wf_knn_tune %>%
finalize_workflow(select_best(fit_knn_cv))
## Warning: No value of `metric` was given; metric 'roc_auc' will be used.
# Fitting our final workflow
final_fit <- final_wf %>%
fit(data = tracks_train)
#generating predictions using the model on the test data
tracks_pred <- final_fit %>% predict(new_data = tracks_test)
tracks_pred %>% head()
## # A tibble: 6 × 1
## .pred_class
## <fct>
## 1 Lewis
## 2 Lewis
## 3 Lewis
## 4 Lewis
## 5 Lewis
## 6 Lewis
# Write over 'final_fit' with this last_fit() approach
final_fit <- final_wf %>% last_fit(tracks_split)
# Collect metrics on the test data!
final_fit %>% collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.766 Preprocessor1_Model1
## 2 roc_auc binary 0.825 Preprocessor1_Model1
#new spec, tell the model that we are tuning hyperparams
tree_spec_tune <- decision_tree(
cost_complexity = tune(), #tune() asks R to try a bunch of different parameters.
tree_depth = tune(),
min_n = tune()) %>%
set_engine("rpart") %>%
set_mode("classification")
#create a grid of options for tuning purposes so we can identify the optimal value of hyperparameters
tree_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 5) #grid_regular shows grid of different input tuning options while levels says how many combinations we should try.
tree_grid #125 options
## # A tibble: 125 × 3
## cost_complexity tree_depth min_n
## <dbl> <int> <int>
## 1 0.0000000001 1 2
## 2 0.0000000178 1 2
## 3 0.00000316 1 2
## 4 0.000562 1 2
## 5 0.1 1 2
## 6 0.0000000001 4 2
## 7 0.0000000178 4 2
## 8 0.00000316 4 2
## 9 0.000562 4 2
## 10 0.1 4 2
## # … with 115 more rows
wf_tree_tune <- workflow() %>%
add_recipe(tracks_recipe) %>%
add_model(tree_spec_tune)
#when you add recipe into workflow, it automatically will prep and bake when necessary.
#workflow pulls together the specification. and then we can fit on the workflow. you could fit the wf_tree_tune
#set up k-fold cv. This can be used for all the algorithms. I switch to 5 fold CV instead of 10 (used in KNN above) for computation speed.
listener_cv <- tracks_train %>% vfold_cv(v=5) #creating 5 folds cross validation
listener_cv
## # 5-fold cross-validation
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [772/193]> Fold1
## 2 <split [772/193]> Fold2
## 3 <split [772/193]> Fold3
## 4 <split [772/193]> Fold4
## 5 <split [772/193]> Fold5
doParallel::registerDoParallel() #build trees in parallel
#200s
#get the results of the tuning
tree_rs <- tune_grid(
wf_tree_tune,
listener ~ ., #function
resamples = listener_cv, #resamples to use
grid = tree_grid, #grid to try
metrics = metric_set(accuracy) #how to assess which combinations are best
)
## Warning: The `...` are not used in this function but one or more objects were
## passed: ''
tree_rs
## # Tuning results
## # 5-fold cross-validation
## # A tibble: 5 × 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [772/193]> Fold1 <tibble [125 × 7]> <tibble [0 × 3]>
## 2 <split [772/193]> Fold2 <tibble [125 × 7]> <tibble [0 × 3]>
## 3 <split [772/193]> Fold3 <tibble [125 × 7]> <tibble [0 × 3]>
## 4 <split [772/193]> Fold4 <tibble [125 × 7]> <tibble [0 × 3]>
## 5 <split [772/193]> Fold5 <tibble [125 × 7]> <tibble [0 × 3]>
#Plot the tuning results to visualize which hyperparameter values work best
autoplot(tree_rs) + theme_light() + labs(title = "Decision Tree Tuning Plot")
show_best(tree_rs)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estim…¹ mean n std_err .config
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0000000001 15 21 accuracy binary 0.728 5 0.0175 Prepro…
## 2 0.0000000178 15 21 accuracy binary 0.728 5 0.0175 Prepro…
## 3 0.00000316 15 21 accuracy binary 0.728 5 0.0175 Prepro…
## 4 0.000562 15 21 accuracy binary 0.728 5 0.0175 Prepro…
## 5 0.0000000001 11 21 accuracy binary 0.727 5 0.0208 Prepro…
## # … with abbreviated variable name ¹.estimator
select_best(tree_rs)
## # A tibble: 1 × 4
## cost_complexity tree_depth min_n .config
## <dbl> <int> <int> <chr>
## 1 0.0000000001 15 21 Preprocessor1_Model071
#picking the best model for the final model
final_tree <- finalize_model(tree_spec_tune, select_best(tree_rs))
#Fit the model on the test data
final_tree_fit <- last_fit(final_tree, listener ~., tracks_split)
final_tree_fit$.predictions
## [[1]]
## # A tibble: 415 × 6
## .pred_Elke .pred_Lewis .row .pred_class listener .config
## <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 0.859 0.141 1 Elke Lewis Preprocessor1_Model1
## 2 0.3 0.7 3 Lewis Lewis Preprocessor1_Model1
## 3 0.164 0.836 4 Lewis Lewis Preprocessor1_Model1
## 4 0.0811 0.919 7 Lewis Lewis Preprocessor1_Model1
## 5 0.0667 0.933 12 Lewis Lewis Preprocessor1_Model1
## 6 0.829 0.171 14 Elke Lewis Preprocessor1_Model1
## 7 0.0414 0.959 15 Lewis Lewis Preprocessor1_Model1
## 8 0.818 0.182 21 Elke Lewis Preprocessor1_Model1
## 9 0.571 0.429 22 Elke Lewis Preprocessor1_Model1
## 10 0.0667 0.933 27 Lewis Lewis Preprocessor1_Model1
## # … with 405 more rows
set.seed(123)
bag_spec_tune <- bag_tree(cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()) %>%
set_mode("classification") %>%
set_engine("rpart", times = 50) #50 trees in a bag
#create tuning grid for hyperparamters
bag_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 5)
#make workflow for bagging
wf_bag_tune <- workflow() %>%
add_recipe(tracks_recipe) %>%
add_model(bag_spec_tune)
doParallel::registerDoParallel() #build trees in parallel
#200s
#find results of the tuning process
bag_rs <- tune_grid(
bag_spec_tune,
listener ~ ., #function
resamples = listener_cv, #resamples to use
grid = bag_grid,
metrics = metric_set(accuracy) #how to assess which combinations are best
)
#Plot the tuning results to visualize which hyperparameter values work best
autoplot(bag_rs) + theme_light() + labs(title = "Bagged Decision Tree Tuning Plot")
show_best(bag_rs)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estim…¹ mean n std_err .config
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00000316 15 11 accuracy binary 0.769 5 0.00921 Prepro…
## 2 0.00000316 15 21 accuracy binary 0.765 5 0.0145 Prepro…
## 3 0.00000316 8 11 accuracy binary 0.764 5 0.0112 Prepro…
## 4 0.0000000001 11 2 accuracy binary 0.763 5 0.00933 Prepro…
## 5 0.0000000001 8 11 accuracy binary 0.762 5 0.00819 Prepro…
## # … with abbreviated variable name ¹.estimator
select_best(bag_rs)
## # A tibble: 1 × 4
## cost_complexity tree_depth min_n .config
## <dbl> <int> <int> <chr>
## 1 0.00000316 15 11 Preprocessor1_Model048
#picking the best model for the final model
final_bag <- finalize_model(bag_spec_tune, select_best(bag_rs))
#fitting the best model on the test data
final_bag_fit <- last_fit(final_bag, listener ~., tracks_split)
#checking out the predictions
final_bag_fit$.predictions
## [[1]]
## # A tibble: 415 × 6
## .pred_Elke .pred_Lewis .row .pred_class listener .config
## <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 0.365 0.635 1 Lewis Lewis Preprocessor1_Model1
## 2 0.274 0.726 3 Lewis Lewis Preprocessor1_Model1
## 3 0.413 0.587 4 Lewis Lewis Preprocessor1_Model1
## 4 0.311 0.689 7 Lewis Lewis Preprocessor1_Model1
## 5 0.320 0.680 12 Lewis Lewis Preprocessor1_Model1
## 6 0.613 0.387 14 Elke Lewis Preprocessor1_Model1
## 7 0.0541 0.946 15 Lewis Lewis Preprocessor1_Model1
## 8 0.446 0.554 21 Lewis Lewis Preprocessor1_Model1
## 9 0.494 0.506 22 Lewis Lewis Preprocessor1_Model1
## 10 0.404 0.596 27 Lewis Lewis Preprocessor1_Model1
## # … with 405 more rows
Go through the modeling process for each model: * Preprocessing. You can use the same recipe for all the models you create. * Resampling. Make sure to use appropriate resampling to select the best version created by each algorithm. * Tuning. Find the best values for each hyperparameter (within a reasonable range).
Compare the performance of the four final models you have created.
Use appropriate performance evaluation metric(s) for this classification task. A table would be a good way to display your comparison. Use at least one visualization illustrating your model results.
set.seed(123)
#setting up the random forest specification
forest_spec_tune <-
rand_forest(min_n = tune(),
mtry = tune(),
trees = 140) %>% # 14 predictors * 10, as suggested by "Hands on Machine Learning in R" by Bradley Boehmke & Brandon Greenwell. Tuning this value led to an extremely long computation time.
set_engine("ranger") %>%
set_mode("classification")
#create grid for tuning min_n and the mtry value
forest_grid <- grid_regular(mtry(c(1,13)), min_n(), levels = 5)
#create a workflow for the random forests
wf_forest_tune <- workflow() %>%
add_recipe(tracks_recipe) %>%
add_model(forest_spec_tune)
doParallel::registerDoParallel() #build forest in parallel
#get results of the tuning to try and find optimal model
forest_rs <- tune_grid(
forest_spec_tune,
listener ~ .,
resamples = listener_cv, #resamples to use
grid = forest_grid,
metrics = metric_set(accuracy) #how to assess which combinations are best
)
#Plot the tuning results to visualize which hyperparameter values work best
autoplot(forest_rs) + theme_light() + labs(title = "Random Forest Tuning Plot")
show_best(bag_rs)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estim…¹ mean n std_err .config
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00000316 15 11 accuracy binary 0.769 5 0.00921 Prepro…
## 2 0.00000316 15 21 accuracy binary 0.765 5 0.0145 Prepro…
## 3 0.00000316 8 11 accuracy binary 0.764 5 0.0112 Prepro…
## 4 0.0000000001 11 2 accuracy binary 0.763 5 0.00933 Prepro…
## 5 0.0000000001 8 11 accuracy binary 0.762 5 0.00819 Prepro…
## # … with abbreviated variable name ¹.estimator
select_best(bag_rs)
## # A tibble: 1 × 4
## cost_complexity tree_depth min_n .config
## <dbl> <int> <int> <chr>
## 1 0.00000316 15 11 Preprocessor1_Model048
#picking the best model for the final model
final_forest <- finalize_model(forest_spec_tune, select_best(forest_rs))
#fitting the optimal model on the test data
final_forest_fit <- last_fit(final_forest, listener ~., tracks_split)
#check out the predictions and metrics
final_forest_fit$.predictions
## [[1]]
## # A tibble: 415 × 6
## .pred_Elke .pred_Lewis .row .pred_class listener .config
## <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 0.446 0.554 1 Lewis Lewis Preprocessor1_Model1
## 2 0.286 0.714 3 Lewis Lewis Preprocessor1_Model1
## 3 0.389 0.611 4 Lewis Lewis Preprocessor1_Model1
## 4 0.3 0.7 7 Lewis Lewis Preprocessor1_Model1
## 5 0.257 0.743 12 Lewis Lewis Preprocessor1_Model1
## 6 0.564 0.436 14 Elke Lewis Preprocessor1_Model1
## 7 0.0429 0.957 15 Lewis Lewis Preprocessor1_Model1
## 8 0.4 0.6 21 Lewis Lewis Preprocessor1_Model1
## 9 0.446 0.554 22 Lewis Lewis Preprocessor1_Model1
## 10 0.457 0.543 27 Lewis Lewis Preprocessor1_Model1
## # … with 405 more rows
final_forest_fit$.metrics
## [[1]]
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.764 Preprocessor1_Model1
## 2 roc_auc binary 0.809 Preprocessor1_Model1
#cleaning model metrics to create a table with each model's performance metrics
forest_results_table <- final_forest_fit %>% collect_metrics() %>%
select(.metric, .estimate) %>%
mutate(method = "Random Forest")
bagging_results_table <- final_bag_fit %>% collect_metrics() %>%
select(.metric, .estimate) %>%
mutate(method = "Bagging")
decision_tree_results_table <- final_tree_fit %>% collect_metrics() %>%
select(.metric, .estimate) %>%
mutate(method = "Decision Tree")
knn_results_table <- final_fit %>% collect_metrics() %>%
select(.metric, .estimate) %>%
mutate(method = "KNN")
majority_class_results_table <- data.frame (.metric = c("accuracy", "roc_auc"),
.estimate = c(nrow(lewis_liked_tracks) / (nrow(lewis_liked_tracks) + nrow(elke_liked_tracks)), .5),
method = c("Dummy Classifier", "Dummy Classifier")
)
full_results <- bind_rows(majority_class_results_table,
knn_results_table,
decision_tree_results_table,
bagging_results_table,
forest_results_table) %>%
select(method, .estimate, .metric)
#table of results
full_results %>%
kbl(caption = "Table 1. Classification Methods and their Respective Accuracies") %>%
kable_classic(full_width = T, html_font = "Cambria")
| method | .estimate | .metric |
|---|---|---|
| Dummy Classifier | 0.6774194 | accuracy |
| Dummy Classifier | 0.5000000 | roc_auc |
| KNN | 0.7662651 | accuracy |
| KNN | 0.8254070 | roc_auc |
| Decision Tree | 0.7180723 | accuracy |
| Decision Tree | 0.7515946 | roc_auc |
| Bagging | 0.7518072 | accuracy |
| Bagging | 0.7981592 | roc_auc |
| Random Forest | 0.7638554 | accuracy |
| Random Forest | 0.8089996 | roc_auc |
#accuracy plot
ggplot(data = full_results, aes(x = method, y = .estimate, fill = .metric)) +
geom_bar(stat='identity', position='dodge') +
theme_minimal() +
scale_fill_manual(values=c("#9fd182", "#7798c9")) +
labs(y = "Accuracy Estimate",
x = "Classification Method",
fill = "Accuracy Metric",
title = "Random Forest Classification Performed the Best Across Both Accuracy Metrics")